1.0 - What is Data Science?

Extract information from Data

Example 1

Years of study / salary relation 1. Understand the relation between an “input” and an “output” 2. Find a function that roughly estimates the data points (called regression)

Other examples

3 Fields

  1. Supervised learning
  2. Unsupervised learning
  3. Semi-supervised learning

Terminology

step size = learning rate

Example: The income dataset

Pasted image 20241014124116.png The data points (\(x_i, y_i\)) \(\in \mathbb{R}^2\) are supposed to be of the form \[y_i = f(x_i) + \epsilon \hspace{3mm} ( \epsilon \rightarrow \text{noise})\] Remark: In general, the function \(f\) is unknown We want to approximate this function \(f\) using the data! Find an approximation \(\hat{f}\) of \(f\) using {(\(x_i, y_i)\)}\(_{1 \leq i \leq N}\)

Nearest neighbours interpolation

The nearest neighbours interpolation of {(\(x_i, y_i)\)}\(_{1 \leq i \leq N}\) is the function \[x \Rightarrow \hat{f}(x) = y_{i*}\] This means for an input \(x\) you find the nearest data point \(x_i\) (i.e. the one with the smallest absolute distance to \(x\)) and assign its corresponding value \(y_i\) to \(\hat{f}(x)\)

when \(i_x \in \text{argmin}_{1 \leq i \leq N} |x - x_i|\)

Next up: 1.1 Linear Regression